{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "722331c8",
   "metadata": {},
   "source": [
    "## 9.4.1 统计分析"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9defdd8c",
   "metadata": {},
   "source": [
    "我们这一节要讲的统计分析主要是针对文本的。在9.2节中，我们针对《越女剑》数据集，构建除出了词表。利用词表我们可以挖掘出很多信息。比如，我们可以打印出词频排名最为考前的一些词。我们先把读取数据集并构建词表的代码执行一下。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6ce03138",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import torch\n",
    "import string\n",
    "from zhon.hanzi import punctuation\n",
    "\n",
    "with open('data/越女剑.txt', 'r') as f:\n",
    "    lines = f.readlines()\n",
    "exclude = set(punctuation)\n",
    "lines = [ ''.join(ch for ch in line if ch not in exclude).replace('\\n','') for line in lines]\n",
    "tokens = [list(line) for line in lines ]\n",
    "\n",
    "import collections\n",
    "\n",
    "class Vocab:\n",
    "    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):\n",
    "        if tokens is None:\n",
    "            tokens = []\n",
    "        if reserved_tokens is None:\n",
    "            reserved_tokens = []\n",
    "        # 按出现频率排序\n",
    "        counter = count_corpus(tokens)\n",
    "        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],\n",
    "                                   reverse=True)\n",
    "        # 未知词元的索引为0\n",
    "        self.idx_to_token = ['<unk>'] + reserved_tokens\n",
    "        self.token_to_idx = {token: idx\n",
    "                             for idx, token in enumerate(self.idx_to_token)}\n",
    "        for token, freq in self._token_freqs:\n",
    "            if freq < min_freq:\n",
    "                break\n",
    "            if token not in self.token_to_idx:\n",
    "                self.idx_to_token.append(token)\n",
    "                self.token_to_idx[token] = len(self.idx_to_token) - 1\n",
    "\n",
    "    def __len__(self):\n",
    "        return len(self.idx_to_token)\n",
    "\n",
    "    def __getitem__(self, tokens):\n",
    "        if not isinstance(tokens, (list, tuple)):\n",
    "            return self.token_to_idx.get(tokens, self.unk)\n",
    "        return [self.__getitem__(token) for token in tokens]\n",
    "\n",
    "    def to_tokens(self, indices):\n",
    "        if not isinstance(indices, (list, tuple)):\n",
    "            return self.idx_to_token[indices]\n",
    "        return [self.idx_to_token[index] for index in indices]\n",
    "\n",
    "    @property\n",
    "    def unk(self):  # 未知词元的索引为0\n",
    "        return 0\n",
    "\n",
    "    @property\n",
    "    def token_freqs(self):\n",
    "        return self._token_freqs\n",
    "\n",
    "def count_corpus(tokens):\n",
    "    # 这里的tokens是1D列表或2D列表\n",
    "    if len(tokens) == 0 or isinstance(tokens[0], list):\n",
    "        # 将词元列表展平成一个列表\n",
    "        tokens = [token for line in tokens for token in line]\n",
    "    return collections.Counter(tokens)\n",
    "\n",
    "vocab = Vocab(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8711c0c8",
   "metadata": {},
   "source": [
    "然后我们利用词表来打印出排名前10的词。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "0a818e72",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('剑', 395),\n",
       " ('的', 289),\n",
       " ('一', 260),\n",
       " ('不', 205),\n",
       " ('道', 202),\n",
       " ('士', 195),\n",
       " ('是', 192),\n",
       " ('了', 174),\n",
       " ('人', 155),\n",
       " ('范', 138)]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "corpus = [token for line in tokens for token in line]\n",
    "vocab = Vocab(corpus)\n",
    "vocab.token_freqs[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b92d1637",
   "metadata": {},
   "source": [
    "可以看到，排名前10的词中，有一些没什么意义的词，比如“的”、“了”等等，这些就是我们前面讲过的停用词了。接下来我们可以绘制出词频图，来更清楚地看到词频分布。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "984f9ae6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x7fedeae30400>]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8+yak3AAAACXBIWXMAAAsTAAALEwEAmpwYAAAcH0lEQVR4nO3deZDdZZ3v8fenl3RnI52lE5J0YwJEMI4SuD0Iopayg47BKsaBsSQ6zM0sWONCXQWt8urUtUrvHUWtmosTBY1zkUXEIcUwI6vl4ELsSBJIIqQlwSRk6RCyk6W7v/eP83Q4p3NCn17Oms+rquv8fs/vd8755gf9yZPnPOf5KSIwM7PaUlfuAszMbPQ53M3MapDD3cysBjnczcxqkMPdzKwGNZS7AIBp06bFnDlzyl2GmVlVWbFixc6IaM13rCLCfc6cOXR2dpa7DDOzqiLppRMd87CMmVkNcribmdUgh7uZWQ0qONwl1Ut6RtJDaX+upKcldUm6V9KY1N6U9rvS8TlFqt3MzE5gKD33TwLrsva/BtwWEWcCrwI3pvYbgVdT+23pPDMzK6GCwl1SG/B+4HtpX8DFwP3plKXANWl7YdonHb8knW9mZiVSaM/9m8Bngb60PxXYHRE9aX8zMDttzwY2AaTje9L5OSQtltQpqbO7u3t41ZuZWV6DhrukDwA7ImLFaL5xRCyJiI6I6GhtzTsHf1DPb9vHNx55np37D49maWZmVa+QnvtFwAclbQTuITMc8y2gRVL/l6DagC1pewvQDpCOTwJeGcWaj1m/Yx/ffqKLXQeOFOPlzcyq1qDhHhG3RkRbRMwBrgOeiIiPAE8C16bTFgEPpu1laZ90/Iko0h1BhFKNxXh1M7PqNZJ57p8DPiOpi8yY+h2p/Q5gamr/DHDLyEo8sf6PaQOnu5lZtiGtLRMRPwd+nrZfBM7Pc84h4M9HobZB9U/Bcc/dzCxXVX9D9VjP3eFuZpajqsO9v+/uYRkzs1xVHe7uuZuZ5Vfd4V7uAszMKlR1h7s8FdLMLJ/qDvf06DF3M7Nc1R3uHnM3M8urNsK9vGWYmVWcKg/3/jF3x7uZWbbqDvf02OdsNzPLUd3hrtc/UjUzs9dVd7inR4/KmJnlqu5w9weqZmZ5VXe4ez13M7O8qjvcj81zd7qbmWWr7nBPj452M7Nchdwgu1nSckmrJK2R9OXU/gNJGyStTD8LUrskfVtSl6TVks4rWvX+hqqZWV6F3InpMHBxROyX1Ag8Jek/0rH/ERH3Dzj/KmBe+nkHcHt6HHXyeu5mZnkVcoPsiIj9abcx/bxRmi4Efpie9xugRdLMkZd6PE9zNzPLr6Axd0n1klYCO4BHI+LpdOgraejlNklNqW02sCnr6ZtT28DXXCypU1Jnd3f3sIp3tpuZ5VdQuEdEb0QsANqA8yX9CXArcDbwp8AU4HNDeeOIWBIRHRHR0draOrSqE6/nbmaW35Bmy0TEbuBJ4MqI2JqGXg4D3wfOT6dtAdqzntaW2kbd619icrqbmWUrZLZMq6SWtD0WuAz4ff84ujLd52uA59JTlgE3pFkzFwB7ImJrEWr38gNmZidQyGyZmcBSSfVk/jK4LyIekvSEpFYyGbsS+Nt0/sPA1UAXcBD4+KhXnXj5ATOz/AYN94hYDZybp/3iE5wfwE0jL60QXs/dzCyf6v6GqnvuZmZ5VXe492843c3MclR3uMvfUDUzy6e6wz09esjdzCxXdYe7Fw4zM8urusP92MJhZmaWrbrD3TfrMDPLq6rDvZ+j3cwsV1WHu8fczczyq+pwr/OC7mZmeVV1uPdne5+z3cwsR3WHO17P3cwsn+oOd6/nbmaWV3WHe3p0z93MLFd1h7tXhTQzy6uqw93ruZuZ5VfIbfaaJS2XtErSGklfTu1zJT0tqUvSvZLGpPamtN+Vjs8pVvHHZkKamVmOQnruh4GLI+IcYAFwZbo36teA2yLiTOBV4MZ0/o3Aq6n9tnReUXjM3cwsv0HDPTL2p93G9BPAxcD9qX0pmZtkAyxM+6Tjl0jF6WN7PXczs/wKGnOXVC9pJbADeBT4A7A7InrSKZuB2Wl7NrAJIB3fA0zN85qLJXVK6uzu7h5W8e65m5nlV1C4R0RvRCwA2oDzgbNH+sYRsSQiOiKio7W1dViv4bVlzMzyG9JsmYjYDTwJXAi0SGpIh9qALWl7C9AOkI5PAl4ZjWIH8nruZmb5FTJbplVSS9oeC1wGrCMT8tem0xYBD6btZWmfdPyJKNJcRa/nbmaWX8PgpzATWCqpnsxfBvdFxEOS1gL3SPpfwDPAHen8O4B/ldQF7AKuK0LdORztZma5Bg33iFgNnJun/UUy4+8D2w8Bfz4q1Q2irs5fYjIzy6eqv6Fan8ZlevvKXIiZWYWp7nBPPfde99zNzHJUdbg39Ie7u+5mZjmqOtzr6zPh3uNbMZmZ5ajqcD/Wc3e4m5nlqOpw779BtnvuZma5qjrc+3vufQ53M7McVR3u/bNl3HM3M8tV1eEuifo6eczdzGyAqg53yHyRyT13M7Nc1R/udaK3z/PczcyyVX24N9TJyw+YmQ1Q9eFeX++eu5nZQFUf7g114qjH3M3MclR9uNdJXvLXzGyAmgh3j8qYmeUq5DZ77ZKelLRW0hpJn0ztX5K0RdLK9HN11nNuldQl6XlJVxT1DyDoc8/dzCxHIbfZ6wFujojfSZoIrJD0aDp2W0T8U/bJkuaTubXeW4FZwGOS3hwRvaNZeNb74SF3M7Ncg/bcI2JrRPwube8jc3Ps2W/wlIXAPRFxOCI2AF3kuR3faJF8mz0zs4GGNOYuaQ6Z+6k+nZo+IWm1pDslTU5ts4FNWU/bTJ6/DCQtltQpqbO7u3volSd1kodlzMwGKDjcJU0AfgJ8KiL2ArcDZwALgK3A14fyxhGxJCI6IqKjtbV1KE/NkRlzH/bTzcxqUkHhLqmRTLDfFREPAETE9ojojYg+4Lu8PvSyBWjPenpbaisK99zNzI5XyGwZAXcA6yLiG1ntM7NO+xDwXNpeBlwnqUnSXGAesHz0Sh5YHzjbzcxyFTJb5iLgo8Czklamts8D10taAASwEfgbgIhYI+k+YC2ZmTY3FWumDLjnbmaWz6DhHhFPAcpz6OE3eM5XgK+MoK6COdzNzI5X9d9QlT9QNTM7TtWHu9eWMTM7XvWHe5177mZmA1V/uHvM3czsOFUf7l5bxszseFUf7nVeW8bM7Dg1EO4eljEzG6gGwh3frMPMbICqD3e5525mdpyqD/c6ry1jZnacGgh3ETjdzcyy1US4eyqkmVmuqg93+QbZZmbHqfpwd8/dzOx4NRDu/hKTmdlANRDungppZjZQIbfZa5f0pKS1ktZI+mRqnyLpUUnr0+Pk1C5J35bUJWm1pPOK+QeorxNHevwtJjOzbIX03HuAmyNiPnABcJOk+cAtwOMRMQ94PO0DXEXmvqnzgMXA7aNedZZZLWPZ/OprHpoxM8syaLhHxNaI+F3a3gesA2YDC4Gl6bSlwDVpeyHww8j4DdAy4Gbao2p2y1gOHull/+GeYr2FmVnVGdKYu6Q5wLnA08CMiNiaDm0DZqTt2cCmrKdtTm0DX2uxpE5Jnd3d3UOt+5iJzZnbwO475HA3M+tXcLhLmgD8BPhUROzNPhaZMZEhjYtExJKI6IiIjtbW1qE8NccpYxsB2Hvo6LBfw8ys1hQU7pIayQT7XRHxQGre3j/ckh53pPYtQHvW09tSW1FMSuH+6gGHu5lZv0Jmywi4A1gXEd/IOrQMWJS2FwEPZrXfkGbNXADsyRq+GXVtk8cCsGnXwWK9hZlZ1Wko4JyLgI8Cz0pamdo+D3wVuE/SjcBLwIfTsYeBq4Eu4CDw8dEseKDpE5sB2HngcDHfxsysqgwa7hHxFKATHL4kz/kB3DTCugo2piHzj4+jPZ4KaWbWr+q/oVpfp8wXmXp7y12KmVnFqPpwB2isF0d73XM3M+tXE+E+pr7OSxCYmWWpjXBvqONIr8PdzKxfbYS7e+5mZjlqItwbG+o46p67mdkxNRHu7rmbmeWqiXBvrHfP3cwsW02E+5iGOg67525mdkzNhLt77mZmr6uNcPeYu5lZjtoId89zNzPLURPh3lgvLxxmZpalJsJ9TEO9e+5mZllqItwb6+UxdzOzLDUR7k0eczczy1HIbfbulLRD0nNZbV+StEXSyvRzddaxWyV1SXpe0hXFKjybv8RkZparkJ77D4Ar87TfFhEL0s/DAJLmA9cBb03P+b+S6ker2BPxVEgzs1yDhntE/ALYVeDrLQTuiYjDEbGBzH1Uzx9BfQXxl5jMzHKNZMz9E5JWp2GbyaltNrAp65zNqe04khZL6pTU2d3dPYIy+odlgr4+T4c0M4Phh/vtwBnAAmAr8PWhvkBELImIjojoaG1tHWYZGcdukt3n3ruZGQwz3CNie0T0RkQf8F1eH3rZArRnndqW2opqTH3mj+FxdzOzjGGFu6SZWbsfAvpn0iwDrpPUJGkuMA9YPrISB3es5+6bZJuZAdAw2AmS7gbeC0yTtBn4n8B7JS0AAtgI/A1ARKyRdB+wFugBboqI3qJUnqXRPXczsxyDhntEXJ+n+Y43OP8rwFdGUtRQvd5zd7ibmUGNfEO1sV4AvmGHmVlSE+He1OBhGTOzbDUR7h6WMTPLVRPhfuwDVYe7mRlQI+HeP8/9qIdlzMyAGgn3xjQsc9g9dzMzoEbCvf8D1YOHiz6l3sysKtREuJ/ROoHmxjqWb3il3KWYmVWEmgj35sZ6zm2fzKrNe8pdiplZRaiJcAeY1TKW7XsPlbsMM7OKUDPhPnNSMzv2HabXa7qbmdVOuM+Y1ExvX7Bz/+Fyl2JmVnY1E+6nTRkHwBO/31HmSszMyq9mwn1BewsAv/qDZ8yYmdVMuE8a28i7503jpVcOlLsUM7Oyq5lwB5gzdTwbdh4gwh+qmtnJbdBwl3SnpB2SnstqmyLpUUnr0+Pk1C5J35bUJWm1pPOKWfxAp00Zx75DPew91FPKtzUzqziF9Nx/AFw5oO0W4PGImAc8nvYBriJz39R5wGLg9tEpszDTJo4BYNeBI6V8WzOzijNouEfEL4BdA5oXAkvT9lLgmqz2H0bGb4CWATfTLqqp45sAeHzd9lK9pZlZRRrumPuMiNiatrcBM9L2bGBT1nmbU9txJC2W1Cmps7u7e5hl5Dp/7hQA9r52dFRez8ysWo34A9XIfHo55E8wI2JJRHREREdra+tIywAya8yc0tzgMXczO+kNN9y39w+3pMf+bw5tAdqzzmtLbSVzythG99zN7KQ33HBfBixK24uAB7Pab0izZi4A9mQN35TErJaxrN26t5RvaWZWcQqZCnk38GvgLEmbJd0IfBW4TNJ64NK0D/Aw8CLQBXwX+PuiVP0G3jZ7Ei+9crDUb2tmVlEaBjshIq4/waFL8pwbwE0jLWokpk1o4rWjvezYd4jpE5vLWYqZWdnU1DdUAc6eORGA+367aZAzzcxqV82F+/vOms6bZ0yg86VXy12KmVnZ1Fy4A5lb7m3a7TVmzOykVZPh/uZTJ/LqwaO8etBTIs3s5FST4T5naubGHRu9/K+ZnaRqM9ynjQfghW37ylyJmVl51GS4z506nhmnNPmuTGZ20qrJcK+rE2+eMZEXtrvnbmYnp5oMd4ALTp/K77ft89ruZnZSqtlwf0v/l5k6/WUmMzv51Gy4v/OMaUxoamDJL170fHczO+nUbLg3N9Zz7X9rY9eBI9zrpQjM7CRTs+EOcMtVZwNw5y83lLkSM7PSqulwb26s59OXvpkXtu/3B6tmdlKp6XAHuOjMqQAs3zDwHt9mZrWr5sP97W0tjKmv45uPvVDuUszMSmbQm3W8EUkbgX1AL9ATER2SpgD3AnOAjcCHI6Js6++Oaajj7JkT2dB9gIhAUrlKMTMrmdHoub8vIhZEREfavwV4PCLmAY+n/bL6cEc7+w73sOZl31vVzE4OxRiWWQgsTdtLgWuK8B5D8mfnzKK5sY6b71tV7lLMzEpipOEewCOSVkhanNpmRMTWtL0NmJHviZIWS+qU1Nnd3T3CMt7YpLGNnNPWwvod++jr8xeazKz2jTTc3xUR5wFXATdJek/2wXTD7LxpGhFLIqIjIjpaW1tHWMbg3v/2mfQF/ONDax3wZlbzRhTuEbElPe4AfgqcD2yXNBMgPe4YaZGj4fL5pwLwg19t5O/uWlHmaszMimvY4S5pvKSJ/dvA5cBzwDJgUTptEfDgSIscDadOambtP17B9IlN/GzNdh5bu73cJZmZFc1Ieu4zgKckrQKWA/8eEf8JfBW4TNJ64NK0XxHGjWngkU9nRo5ueeBZLyhmZjVr2PPcI+JF4Jw87a8Al4ykqGJqGTeG689v5+7lm7jhzuXcfPlZLGhvKXdZZmajqua/oZrPlz74VmZOaua/1u/kmn/+JT9/viI+FjAzGzUnZbg3NdTzX599H//vxncA8LHv/5ZP3fMM+w4dLXNlZmaj46QMd4CG+jreNW8aP/rv76BlXCP/tvJl3valR/jFC8Wdc29mVgonbbj3e+cZ0/jtFy7lry6aC8ANdy7nc/ev9lx4M6tqJ324AzTW1/HFP5vPw//wbsaNqefezk2862tP0LnRywSbWXVyuGeZP+sUnvniZXS8aTIv7znEtd/5NR/+zq/5cecm/tC9v9zlmZkVbERL/taipoZ6fvy3F7Jq8x5uvm8lyzfuYnnqwS9+z+l89II30T5lXJmrNDN7Y6qEL/J0dHREZ2dnucs4TkTw8p5DrPzjbm760e+Otb9p6jgWLpjNpy6ZR12d14c3s/KQtCJrufXcYw73wuw/3MMja7bx1PqdPPDMFgDG1Ndx4RlT+eA5s5g3YwJvb2spb5FmdlJxuI+yfYeO8vVHXuCRNdt4ec+hY+3tU8Yya9JYvvD+tzjozazoHO5F0tsXvLz7NV565SA/fWYLK17axcZXDgKZoZsF7S2896zMcsbNDfVc/tZTqfcwjpmNEod7iUQEv37xFe5fsZnH1m5n76GenOON9aJ9yjj+8vzTmDahifFNDVz6lum+r6uZDYvDvQyO9PTx8u7XgMzdSn7wyw1s2X2Ix9blLjU8bkw90yc2USdx8dnTeVvbpGPb45s8mcnMTszhXkF27j/MvkM99PYFdzy1gYNHeoiAZatePu7cudPGA/Ansydx6VumAzBtQhMXnTmtpDWbWWVyuFeBfYeO0r3vMAAPrnyZDTsPAPCL9d3sPpi7oFnrxCYmDujVv2XWKVzx1lNP+PrTJzZxwelTR7lqMysnh3sVO9LTx6ZXMx/SHjzcy/d/uYGjA9a9+WXXTnYdODLoa82a1EzzmPpBz6uTeN9ZrbxtiDN+2ieP5dzTJg/pOWY2fG8U7kUb1JV0JfAtoB74XkRUzB2ZqsmYhjrOaJ1wbP8bf7HguHOO9vbxUpqlk8/eQ0f54a820lPgYmgPrd5K147hLbcwZ+q4Ic0IqpN45xlT6ZgzZVjvl+1tsycxJw1lmZ3sitJzl1QPvABcBmwGfgtcHxFr853vnntl2XvoKDv2Hh7Sc3bsO8Q9yzfRO8T/n/7zuW30juIKnGdOnzD4SSM0vqmBG981l2qc1Tp/5imc3lr8a2SlUY6e+/lAV7oVH5LuARYCecPdKsspzY2c0tw4pOecOX0C7zxj6B/0Hjjcw9Y9rw35eQM9u2UPj6/bQbFHGf/QvZ9Vm3bzD3c/U9w3KqJ5JfgL0Ar3F3/azl+/+/RRf91ihftsYFPW/mbgHdknSFoMLAY47bTTilSGVbrxTQ2cOX3iiF/nzOkT+dC5baNQ0RuLCP646yBHevqK/l6jbc3Le3l03XbfGL7CTJvQVJTXLdtE6ohYAiyBzLBMueowGwpJvGlqdY7rz5sxkWvOnV3uMqxEirWe+xagPWu/LbWZmVkJFCvcfwvMkzRX0hjgOmBZkd7LzMwGKMqwTET0SPoE8DMyUyHvjIg1xXgvMzM7XtHG3CPiYeDhYr2+mZmdmO+hamZWgxzuZmY1yOFuZlaDHO5mZjWoIlaFlNQNvDTMp08Ddo5iOcXkWovDtRaHay2O0az1TRHRmu9ARYT7SEjqPNHCOZXGtRaHay0O11ocparVwzJmZjXI4W5mVoNqIdyXlLuAIXCtxeFai8O1FkdJaq36MXczMzteLfTczcxsAIe7mVkNqupwl3SlpOcldUm6pQLqaZf0pKS1ktZI+mRqnyLpUUnr0+Pk1C5J3071r5Z0XonrrZf0jKSH0v5cSU+neu5NyzUjqSntd6Xjc0pZZ6qhRdL9kn4vaZ2kCyv4un46/fd/TtLdkpor5dpKulPSDknPZbUN+TpKWpTOXy9pUQlr/T/p/4HVkn4qqSXr2K2p1uclXZHVXvScyFdr1rGbJYWkaWm/NNc1Iqryh8xSwn8ATgfGAKuA+WWuaSZwXtqeSOYm4fOB/w3cktpvAb6Wtq8G/gMQcAHwdInr/QzwI+ChtH8fcF3a/g7wd2n774HvpO3rgHvLcG2XAn+dtscALZV4XcncYnIDMDbrmn6sUq4t8B7gPOC5rLYhXUdgCvBiepyctieXqNbLgYa0/bWsWuenDGgC5qZsqC9VTuSrNbW3k1n6/CVgWimva8l+OYtwMS8Efpa1fytwa7nrGlDjg8BlwPPAzNQ2E3g+bf8LcH3W+cfOK0FtbcDjwMXAQ+l/tJ1ZvzjHrm/6n/PCtN2QzlMJr+OkFJga0F6J17X//sFT0rV6CLiikq4tMGdAYA7pOgLXA/+S1Z5zXjFrHXDsQ8BdaTvn97//upYyJ/LVCtwPnANs5PVwL8l1reZhmXw34a6YG0Smf16fCzwNzIiIrenQNmBG2i7nn+GbwGeB/js9TwV2R0RPnlqO1ZmO70nnl8pcoBv4fhpG+p6k8VTgdY2ILcA/AX8EtpK5Viuo3GsLQ7+OlfK791dkesBQgbVKWghsiYhVAw6VpNZqDveKJWkC8BPgUxGxN/tYZP5KLuv8U0kfAHZExIpy1jEEDWT+yXt7RJwLHCAzfHBMJVxXgDRevZDMX0izgPHAlWUtaggq5ToORtIXgB7grnLXko+kccDngS+Wq4ZqDveKvAm3pEYywX5XRDyQmrdLmpmOzwR2pPZy/RkuAj4oaSNwD5mhmW8BLZL6786VXcuxOtPxScArJaiz32Zgc0Q8nfbvJxP2lXZdAS4FNkREd0QcBR4gc70r9drC0K9jWX/3JH0M+ADwkfSXEW9QU7lqPYPMX/Cr0u9ZG/A7SaeWqtZqDveKuwm3JAF3AOsi4htZh5YB/Z98LyIzFt/ffkP69PwCYE/WP4+LJiJujYi2iJhD5ro9EREfAZ4Erj1Bnf31X5vOL1nvLiK2AZsknZWaLgHWUmHXNfkjcIGkcen/h/5aK/La5qmhkOv4M+BySZPTv1QuT21FJ+lKMsOJH4yIgwP+DNel2UdzgXnAcsqUExHxbERMj4g56fdsM5nJFtso1XUtxgcLpfoh86nzC2Q+Df9CBdTzLjL/pF0NrEw/V5MZQ30cWA88BkxJ5wv451T/s0BHGWp+L6/PljmdzC9EF/BjoCm1N6f9rnT89DLUuQDoTNf238jMJqjI6wp8Gfg98Bzwr2RmcFTEtQXuJvNZwFEygXPjcK4jmfHurvTz8RLW2kVmXLr/9+s7Wed/IdX6PHBVVnvRcyJfrQOOb+T1D1RLcl29/ICZWQ2q5mEZMzM7AYe7mVkNcribmdUgh7uZWQ1yuJuZ1SCHu5lZDXK4m5nVoP8PhZPoCqQqu50AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "freqs = [freq for token, freq in vocab.token_freqs]\n",
    "plt.plot(freqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5bbd998a",
   "metadata": {},
   "source": [
    "### 9.4.1.1 长尾分布"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68411b5e",
   "metadata": {},
   "source": [
    "可以看到词频分布出现了长尾分布。长尾分布是指在数据分布中，极少数的数据占据了大部分的位置，而大多数的数据只占据了极少的位置，这种分布状态被称为长尾分布。\n",
    "\n",
    "长尾分布常见于数据中的搜索结果，例如在搜索引擎中，极少数的搜索结果会被大量访问，而大多数的搜索结果只会被很少访问。在自然语言领域，相对于前面的高频词，这些长尾反而更加重要。因此在计算中，会考虑去除前面高频停用词，或者用一些方法来降低高频特征权重，提升低频特征和未见过特征的权重，比如前面讲过的平滑方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2525469e",
   "metadata": {},
   "source": [
    "### 9.4.1.2 齐夫定律"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc020d92",
   "metadata": {},
   "source": [
    "除了长尾分布，还有一种鼎鼎大名的词频分布规律，叫齐夫定律。\n",
    "\n",
    "齐夫定律，也叫Zipf定律，是由哈佛大学的语言学家乔治·金斯利·齐夫（George Kingsley Zipf）于1949年发表的实验定律。它是一种语言学定理，它指出在自然语言中，一个单词出现的频率与其在词汇表中的排名成反比例。频率最高的单词出现的频率大约是出现频率第二位的单词的2倍，而出现频率第二位的单词则是出现频率第四位的单词的2倍。这个定律被作为任何与幂定律概率分布有关的事物的参考。\n",
    "\n",
    "我们把刚刚的词频曲线的坐标轴设置为对数坐标，来看一下结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "2afb3955",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x7fedea7b3e80>]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8+yak3AAAACXBIWXMAAAsTAAALEwEAmpwYAAAcmUlEQVR4nO3deXxV1b338c/vnJORTEDClBAhBBkEZYggIqhVK2oRp1rBDrYKeqvtrbePfWprn07Xq7Xt05Y6UBBK25dTpa11HrAKCqhMikwKhCkgYwYgBDKt+8cJlCGBhByyz9nn+3698krO2jsnv5ydfLOy9tprm3MOERHxl4DXBYiISOQp3EVEfEjhLiLiQwp3EREfUriLiPiQwl1ExIdCXhcAkJ2d7Xr06OF1GSIiMWXx4sW7nHM5jW2LinDv0aMHixYt8roMEZGYYmYbm9rm6bCMmY01s6kVFRVeliEi4juehrtz7gXn3KTMzEwvyxAR8R2dUBUR8SGFu4iIDyncRUR8SOEuIuJDCncRER/SVEgRER/SVEgRER/SsIyIiA8p3EVEfEjhLiLiQwp3EREfUriLiPiQwl1ExIcU7iIiPqRwFxHxIU/vxGRmY4GxiV0K6f3Dl0/pOZJCQbpmJpPbPoXcrBS6ZaWQ1/BxbvsUOqUnEwxYZAsXEYly5pzzugbyzhzgvj151il9blVNHVvLq9hSXsXW8gOUVlYftT0UMLpkJh8O+7yGPwBH/jFITghG4tsQEWlTZrbYOVfU2LaouIdql4xkvjemb0Sea391LVvLqygpOxT4VWxp+Pi9dbvZtucA9cf8PctOSyI369+9/9wj/gDkZaWSkRLCTL1/EYkdURHukZSaGKKwUzqFndIb3V5bV8+2PQcOB/6Wsiq2VoT/GKzetpc3V+3gYG39UZ+TlhRqCPzwH4DCnDSuHZxHZmpCW3xLIiIt5rtwP5lQMEBe+1Ty2qc2ut05x+7K6qN6/CVlVYeHfpZuLqd8fw2/fO0Txg/L5xsX9KRbVkobfxciIicWd+F+MmZGdloS2WlJnJ2X1eg+K7fuYercdfxx/gZmzt/A1YO6cfvoXvTp0vh/CyIibS0qTqgWFRW5RYsWeV1Gi5WU7Wf6u+t5+oPNVNXU8bm+nbh9dAHDenbQGL2InHYnOqGqcI+Asspq/vLeRmbO30BpZTWD87O4fXQvLuvfWdMwReS0Ubi3karqOmYtKWHa3GI2le6nILsdE0cXcO3gXE23FJGIU7i3sdq6el5dsY0pc9axfMsectKT+PrIHtw8/AwyUzTDRkQiQ+HuEecc89ftZsqcdbyzZhftEoNMGB6eYdM1UzNsRKR1FO5RYMXWCv4wp5iXPv6MgMElfTvTKSOJlIQgyQlBUhODpCQe8XFC+PGh92lJITpnJJMQ1HJAIhKmcI8im0vDM2zeWLmdyupaqqrrjrtoqikBg66ZDVfONiylEJ6zH37fNUvhLxJPojbcDy0cVlhYOHHNmjWe1eG1unrHgZo6qmrqqKpu/P2eqprDyyocWlrhs4qqo5ZSCBhkpiQcfstoeJ+VGn7fJSOZqwflatxfxCeiNtwPiaeeeyTV1NWzreIAm8v2Hw79sspqKqpqGn2rq3dkJIe49YICvn5BDzKSFfIisSzqFw6TU5MQDNC9QyrdOzS+lMKRnHOs2LqH3725ht/M/pQZ89Zz2wU9uWVkD9IV8iK+o557HFq+pYLfzl7D7FXbyUxJYOKontwysidpSfpbLxJLTtRz19m3ODQgN5PHv1bEC3ddQNEZ7fnV659y+W/mMn/dLq9LE5EIUbjHsYF5mUy/5Vxm3TGCxFCACdPe5yfPr6Cqus7r0kSklRTuQlGPDrz87VHccn4PZs7fwFWT32HppjKvyxKRVlC4CwApiUF+cvVZPHHbcA7U1HH9Y/P51WufUFPXvDn4IhJdFO5ylJGF2bx692iuH5LHw2+tZcK099ix54DXZYlICync5TgZyQn88ovn8LubBrF8yx6u+v27vF+82+uyRKQFFO7SpHGDcvnnXSNJTwox4fH3mTa3mGiYOisiJ6dwlxM6s3M6/7xrJJf168z9L6/izieXsPdAjddlichJKNzlpNKTE3jsy0O494q+vLZiO1dNfpdlJeVelyUiJ6Bwl2YxM26/sBfPTDqP2rp6rn9sPtPfXa9hGpEopXCXFinq0YGX/3MUF57ZiZ+/uJKJf17M8i0VGqoRiTJaTERaLCs1kWlfHcof523ggVdWMXvVdgA6tEuke4dUctKSaJ+aQPt2iXRKT+KKgV3JzdKdp0TakhYOk1bZXLqfj7dUsKl0P5tK97O5dD879x6kfH8Npfurqa6tJ2Dwub6d+fJ5+YzunUMgYF6XLeILWvJXTpsTLTnsnKOkrIqnF27imYWbmb1qOx3aJXJBYTajemdzXkFH8tqnYKawF4k09dylTVTX1vP6ym38a9UO5q7Zxa59BwFon5rAgNxMstOSAAgFjAv75HBpv84kJwS9LFkk6rX5nZjM7BrgKiADmO6ce/1E+yvc44tzjlWf7WXp5jI+Lqng4y0V7D1QC8DeAzWU7a8hPTnEF87uxg1DcxmS3169e5FGRCTczWwG8AVgh3NuwBHtY4DfAUHgcefcg0dsaw/8yjl364meW+Euh9TVOxas283fl5TwyvJtVNXU0TO7HT+9+ixGn5njdXkiUSVS4T4a2Af8+VC4m1kQ+BS4DCgBFgLjnXMrG7b/GnjCObfkRM+tcJfG7DtYy6vLtzFlzjrW7tjHLef34OpB3QDo3zVDwzYS9yJyQtU5N9fMehzTPAxY65wrbvhCTwPjzGwV8CDwysmCXaQpaUkhbhiaxxfO7sqDr6xm5vwNzJy/AYC+XdJ59o4Ruv+rSBNaO1smF9h8xOMSYDjwLeBSINPMCp1zU479RDObBEwCyM/Pb2UZ4mfJCeG15r90bne27znAZxUHuO+55dz15FKmf62IUFDX4okc67RMhXTOTQYmn2SfqcBUCA/LnI46xF/6dc2gX9eMw4/v/fvHXPPoPIbkt+eGoXmcnZflXXEiUaa1XZ4tQPcjHuc1tImcVuOH5fOzcWeRHAryt8Ul3PiHBcxbqxt8ixzSoqmQDWPuLx5xQjVE+ITqJYRDfSEwwTm3opnPNxYYW1hYOHHNmjUtLF0kbNe+g9w87X027K7kkn6dCAUC9OmSzoheHRncPUvTKMW3IjVb5ingIiAb2A782Dk33cyuBH5LeCrkDOfc/S0tULNlpLVKK6v5P89+xKbS/RysrWNzaRUAQ/Kz+NbnenNRnxyFvPhOm1/E1FIKd4m08v3VvPDRVqbMKWZLeRUDcjN4dMJQ8js2vlSCSCw6UbhrmoH4UlZqIl8Z0YO377mIh244m5KyKm6e/h7bdbNviROehruZjTWzqRUVFV6WIT6WEAxwY1F3Zn59GKX7qrn64Xf52+ISDtbWeV2ayGmlYRmJG8tKyrnvueUsK6mgXWKQW0cVcPelvTUWLzFLS/6KAGfnZfHcN0cyZ81Onl20mclvrmHn3gPceXEhee01Fi/+ojF3iSuBgHFxn048MmEIt48u4KkPNjPqobf447z1XpcmElHquUtcMjPuvbIfE4bn8/MXV/HTF1ayZFM53duncMPQPApy0rwuUaRVPB1z10VMEg0O1NTx3Wc/4sNN5Wzfc4B65/jBlf24pF9n8jukEtRtASVKaZ67SDPt2HuAH/5jOW+sDN/0u0fHVCaOLuCCwmy6ZaWQoEXKJIoo3EVaoL7e8fanO9hWcZBnFm7io5LwVN0zO6fx5MTzDt8SUMRrCneRU+ScY9HGMlZsqeDBV1fTr2sGs+44X0M1EhV0harIKTIzzu3RgVtG9uSB6waydFN4rvzHJbrwTqKbrlAVaaZrBuXy+f6deeqDTXzxD/P52+ISduzVcgYSnTQsI9ICzjm2lFdx09T3KCmr4pzuWTz3zfN1lat4QsMyIhFiZuS1T+W174zm/47py0eby3lx2WdelyVyHIW7yClolxTitlE96dslnf98ein///VPqK2r97oskcMU7iKnKCEY4Nk7RnDt4Dwm/2st1zw6jwdfWU1dvfdDnSIKd5FWSE9O4Nc3nsPvxw/m0+37mDJnHYs3lnldlohmy4hEwthzurH4vktJCBqzV233uhwRb8PdOfeCc25SZmaml2WIRER6cgIjemUzdW4xdz6xhCWbyli6qYzSymqvS5M4pFUhRSLoe5f3oUtGErMWl/DSx+FZNBnJISaOKuCr5/cgMyXB4wolXmieu8hpsGb7XkrKq6iprefBV1dTvLOS7LQkbh9dwMTRBV6XJz6hOzGJtLHendPp3TkdgMv6d+atT3Zw/0uruP/lVZRXVTNxVAFZqYkeVyl+ptkyIqeZmfG5vp15/e4LGXNWFx55ax3jp71PNPzXLP6lcBdpI8GA8fsJgxk/rDurPtvD92Yto0YXPslpoqmQIm0oIRjgR1/oz5md03h2cQljfjuXlVv3eF2W+JBOqIp4ZNrcYn7x6mpCQSMjOTyLpk+XdKZ+pYiUxKDH1Uks0M06RKLU6m17+MuCjdQ7x8Gaev6+dAvtEoM8e8f59O+W4XV5EuUU7iIx4h9LS/juXz+i3sGjNw9hQLdM8jumel2WRClNhRSJEdcOzqMwJ50Jj7/HN59YAsC9V/Tl/F7ZnNUtg4Bu7yfNpJ67SBTasfcASzaW8dicYj7aXA7Aty/pzVfOO4OcdN2gW8I0LCMSo/ZX17JwQxm/m/0pSzaV0zO7Hb++8RzO7JxOWpL+8Y53uhOTSIxKTQxx4Zk5TP/aufzgyr6s31XJdY/O5+Zp77Fz70Gvy5MopnAXiQHt2yUycVQBf719BF8+L5+PSioY+eC/2Fy63+vSJEop3EVihJkxrGcH7ruqP/dc3ofqunpGPfQWcz/d6XVpEoV0hapIjElOCHLnxYU8evMQAL464wMef6fY46ok2uiEqkgMm/vpTib9ZRFBM3rmtGPiqALGDcr1uixpIzqhKuJTo8/M4YnbhjOiVzY79x7kf15exQOvrKK6VguSxTvNpRKJcUPP6MDjX+vAS8s+4ycvrOAPc4pJDAYo7JRGwIyL+3bStMk4pGEZER+prq1nxANvsvuI+7ZOGl3AD67s52FVcrpo+QGROJEYCvCv717E7srwHPi7n/mQqXOL6d4+ha+M6OFtcdKmNOYu4jOZqQkU5KRRkJPGQzecA8DDb63l1pkLWbyx1OPqpK0o3EV8rE+XdB664Ww6ZyQzb90ufjt7DW+u2s7HJZp+7HcacxeJE995einPfbgVCN/y7/0fXEJ2mhYhi2WaCiki/M91A3n+rpE8dMPZ1NU7Lvrl25SUafkCv1K4i8SJ1MQQZ+dlcc2gXC7r35l9B2uZ8e4G5q3dRV299//BS2Qp3EXiTGIowNSvDCUnPYkZ89Zz8+Pv88bKbV6XJRGmMXeROLVjzwFKyqv44pQF9MppR8/sdnTJSObHY8/SHZ9iRNSOuWvhMBHvdMpIZkh+e246tzsBM5Zv2cOfFmykeFel16VJBKjnLiIALCsp5+qH5xGw8NDNY18eysV9OnldlpxA1PbcRSR6DOiWyQ+v7Md/XNSL2jrHzHkbqKqu87osOUUKdxEBIBAwJo4u4J7L+9IrJ405n+7kR/9c7nVZcooU7iJynGlfLSI1McjSTWXMXrmdiqoar0uSFlK4i8hx8jumMmFYPut2VnLbnxfxmzc+9bokaSGtCikijbpnTB+uGZzL3c98eLgHD5AQCjCioCOJIfUNo5nCXUQalRQKMiA3k/7dMvjnh1u57c//ntH26y+ew/VD8zysTk5G4S4iJ/TAdQO57YICAOqc45pH5rGspJxhPTvQvUOqx9VJU/R/lYicUGpiiIF5mQzMy2RQ9yy6ZCTzpwUbGfXQW3y4udzr8qQJCncRaZGZ3ziXn4ztD8AGXc0atRTuItIifbtkcO2Q8Hj7//vnckY88CY3TlmglSWjjMbcRaTFMlMSuOfyPmzcXUnxzko+2FBKaWU1Oem6+Ue0ULiLyCm58+JCAF5ctpVFG8v447z1ZKclMTg/i8H57T2uThTuItIqvXLSCAWMR99eB0DfLum8+p3RHlclCncRaZV+XTNY9pPPU1Pr+OmLK3h3zS6vSxIU7iISAamJIUiEnLQkdu07yLiH3yU1McTvxg+iU3qy1+XFJc2WEZGIuWJgVy7u04lQMMCC4t2s2LLH65LilsJdRCJmUPcspt9yLr+4/mwA9hzQapJeUbiLSMRlJIdHfH/w948Z+vM3tKqkByIe7mZWYGbTzWxWpJ9bRGJDTnoS91zeh2uH5JLQMEQjbatZ4W5mM8xsh5ktP6Z9jJl9YmZrzez7AM65YufcraejWBGJDWbGnRcX8t/XDOSsbhlUHqz1uqS409ye+0xgzJENZhYEHgGuAPoD482sf0SrE5GYl5oUYt3OfXxxyny+MXMhezUO3yaaFe7OublA6THNw4C1DT31auBpYFxzv7CZTTKzRWa2aOfOnc0uWERiy7hzujEkvz2VB+v41+odrNmxz+uS4kJrxtxzgc1HPC4Bcs2so5lNAQab2b1NfbJzbqpzrsg5V5STk9OKMkQkml3avzNPTjyPHzesJFlVXedxRfEh4hcxOed2A3dE+nlFJLalJobjZuGGUmobVpDM75BKz+x2XpblW60J9y1A9yMe5zW0NZuZjQXGFhYWtqIMEYkF2emJAPx29prDbZ3Sk/jgh5d6VZKvtSbcFwK9zawn4VC/CZjQkidwzr0AvFBUVDSxFXWISAzompnCm9+9kPL94ROqT7y3kReXfeZxVf7VrHA3s6eAi4BsMysBfuycm25mdwGvAUFghnNuxWmrVERiXq+ctMMfv7NmJ9V19dTVO4IB87Aqf2pWuDvnxjfR/jLwckQrEpG4kJwQBOCpDzaRGPz33I4zOqYyvKCjV2X5hqerQmrMXSR+5WalAHDfc0ddG0lyQoDVP7/Ci5J8xdNw15i7SPwae043hvfsQM0R916dOW89095Zr6GaCNB67iLimU4ZR6/13jEtfA/W6tp6UhKDXpTkG1oVUkSixqGx9+raeo8riX3quYtI1Dh0kvXc+2fDMaMyvTul8dK3R3lQVWzSCVURiRqXn9WZbRVVVNe5o9qXbCrjg/WlVNfWkxjSgENz6ISqiESNjmlJ/Nfn+xzX/vg7xXywvpSqmjqFezPpVRKRqHdoXRotOtZ8CncRiXopieGoqqpRuDeXTqiKSNRLSQhH1U1TF5AQPLpPmpWawJMTzyMjOcGL0qKWTqiKSNQ7r6ADNw/PP67n/ln5ARYU72bT7v0MyM30qLropBOqIhL1slITuf/agce1v/XJDhYU76a6TvPij6UxdxGJWUkNM2cO1ijcj6VwF5GYlRQKX/R0sFYnWo+lcBeRmHW4567lCo6j2TIiErMOhfuPnlvOL15dfbjdgO+N6cvlZ3XxqDLvabaMiMSsntnt+OqIMyitrD6q/bUV21iwbrfC3SuaLSMirREKBvjZuAHHtQ/9+RvUxPkMGo25i4jvJAQD1B6z+Fi8UbiLiO+Egqaeu9cFiIhEWmIwcNTt++KRwl1EfCcUNGrifHqkpkKKiO8kBAOs2bGXR99ee9y2c/KyGFmY7UFVbUtTIUXEd3rlpPH8R1t56NVPjtt2RsdU5txzsQdVtS1zzvtxqaKiIrdo0SKvyxARn3DONXrV6g//sZz563ax4N5LPKgq8sxssXOuqLFtGpYREd8xs8M32z5SYihAbZycaNUJVRGJG8EA1CvcRUT8JRRQz11ExHeCAaNO4S4i4i8KdxERH1K4i4j4UChg1NbHx5WrmgopInEjFAhQ76Dfj15tdHvA4GfjBnD90Lw2rizydIWqiMSN64bkUlVTR30TF2/OeHc9q7ftaeOqTg/drENE4kb3Dql8/4q+TW5/8v1N+GVIXmPuIiINzGiyVx9rFO4iIg0CZvgk2xXuIiKHBAzfTJVUuIuINAiYaVhGRMRvAgHTCVUREb8JWHgteD9QuIuINNCwjIiID4XD3esqIkPhLiLSQPPcRUR8KBgw39ypSQuHiYg0CJixbEsF//3iypPuGwwYNw8/g/yOqW1QWctp4TARkQaDumfx+optPPXBppPuW1ldR3JCkLsvO7MNKms5LRwmItLgN18a1Ox9e977UlSPz2vMXUTkFBhE9To0CncRkVMQ7XPiFe4iIqfADKI32hXuIiKnxKJ8eWCFu4jIKQiPuUdvuivcRUROQcBMwzIiIn5jRlRfzapwFxE5Beq5i4j4kBHdi4wp3EVEToXpIiYREd8JmGm2jIiI3+giJhERHwroIiYREf/RCVURER8yTYUUEfEfMy0/ICLiOwFNhRQR8R8jutdzj/ht9sysHfAoUA287Zx7ItJfQ0TEa+aHnruZzTCzHWa2/Jj2MWb2iZmtNbPvNzRfB8xyzk0Ero5wvSIiUSF8Jyavq2hac3vuM4GHgT8fajCzIPAIcBlQAiw0s+eBPODjht3qIlapiEiUWbihlP965sNWPceE4fkU9egQmYKO0Kxwd87NNbMexzQPA9Y654oBzOxpYBzhoM8DPuQE/xmY2SRgEkB+fn5L6xYR8dRFfXKYu2YnCzeWtup5Pn9WlwhVdLTWjLnnApuPeFwCDAcmAw+b2VXAC019snNuKjAVoKioKIr/uREROd791w70uoQTivgJVedcJfD1SD+viIg0X2umQm4Buh/xOK+hTUREPNaacF8I9DaznmaWCNwEPN+SJzCzsWY2taKiohVliIjIsZo7FfIpYAHQx8xKzOxW51wtcBfwGrAK+KtzbkVLvrhz7gXn3KTMzMyW1i0iIifQ3Nky45tofxl4OaIViYhIq3m6/ICGZURETg9Pw13DMiIip4cWDhMR8SGLhvWIzawCWNPIpkzg2DGbY9uygV2nqbSmNFbX6X6O5u5/ov1auq25bToGzd+nqe0taY/X34Hmfk48HYMs51xOo1udc56/AVOb235sG7AoWuo9nc/R3P1PtF9Lt7WgTcegmfu05Ge9uccgXl5/HYOWfZ/RMizT1DIFjbU3uaRBG4pEDS19jubuf6L9WrotWl9/iN5jcLJ9WvKz3lR7NBwDL17/5n6OjgFRMizTGma2yDlX5HUd8UzHwFt6/b0XjccgWnrurTHV6wJEx8Bjev29F3XHIOZ77iIicjw/9NxFROQYCncRER9SuIuI+JDvwt3M2pnZn8xsmpnd7HU98cjMCsxsupnN8rqWeGRm1zT8/D9jZp/3up54ZGb9zGyKmc0ys//wooaYCHczm2FmO8xs+THtY8zsEzNba2bfb2i+DpjlnJsIXN3mxfpUS46Bc67YOXerN5X6Uwtf/+cafv7vAL7kRb1+1MJjsMo5dwdwIzDSi3pjItyBmcCYIxvMLAg8AlwB9AfGm1l/wneEOnRv17o2rNHvZtL8YyCRN5OWv/73NWyXyJhJC46BmV0NvIRHy6LHRLg75+YCx95ifBiwtqGXWA08DYwjfKPuvIZ9YuL7iwUtPAYSYS15/S3sF8ArzrklbV2rX7X0d8A597xz7grAk+HhWA6/XP7dQ4dwqOcCfweuN7PHiI5LhP2s0WNgZh3NbAow2Mzu9aa0uNDU78C3gEuBG8zsDi8KiyNN/Q5cZGaTzewPeNRzb9admGKJc64S+LrXdcQz59xuwuO94gHn3GRgstd1xDPn3NvA217WEMs99y1A9yMe5zW0SdvRMfCWXn/vRe0xiOVwXwj0NrOeZpYI3AQ873FN8UbHwFt6/b0XtccgJsLdzJ4CFgB9zKzEzG51ztUCdwGvAauAvzrnVnhZp5/pGHhLr7/3Yu0YaOEwEREfiomeu4iItIzCXUTEhxTuIiI+pHAXEfEhhbuIiA8p3EVEfEjhLiLiQwp3EREfUriLiPjQ/wL2eKyVuUKZGgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ax.set_xscale('log', base=10)\n",
    "ax.set_yscale('log', base=10)\n",
    "plt.plot(freqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cd73b5e",
   "metadata": {},
   "source": [
    "可以看到词频曲线基本趋近于双对数坐标图上的一条直线。单个词的词频分析，相当于一元语法，我们再来看看多元的情况如何。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "80be3db5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(('剑', '士'), 134),\n",
       " (('范', '蠡'), 117),\n",
       " (('阿', '青'), 62),\n",
       " (('吴', '国'), 53),\n",
       " (('青', '衣'), 47),\n",
       " (('勾', '践'), 47),\n",
       " (('衣', '剑'), 46),\n",
       " (('锦', '衫'), 40),\n",
       " (('衫', '剑'), 40),\n",
       " (('长', '剑'), 36)]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 二元语法词频\n",
    "bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]\n",
    "bigram_vocab = Vocab(bigram_tokens)\n",
    "bigram_vocab.token_freqs[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "95d26488",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(('青', '衣', '剑'), 46),\n",
       " (('衣', '剑', '士'), 46),\n",
       " (('锦', '衫', '剑'), 40),\n",
       " (('衫', '剑', '士'), 40),\n",
       " (('国', '剑', '士'), 27),\n",
       " (('伍', '子', '胥'), 23),\n",
       " (('吴', '国', '剑'), 20),\n",
       " (('名', '青', '衣'), 18),\n",
       " (('范', '蠡', '道'), 18),\n",
       " (('那', '少', '女'), 18)]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 三元语法词频\n",
    "trigram_tokens = [triple for triple in zip(\n",
    "    corpus[:-2], corpus[1:-1], corpus[2:])]\n",
    "trigram_vocab = Vocab(trigram_tokens)\n",
    "trigram_vocab.token_freqs[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8808d27d",
   "metadata": {},
   "source": [
    "我们用一张图来看一下不同语法分布的规律。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "d447d7c0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8+yak3AAAACXBIWXMAAAsTAAALEwEAmpwYAAA16UlEQVR4nO3deVxU9frA8c93hhmQVdlEEUFFJRZXxC1LS80yTU0zs8w0u3Wvdav7u5W272bLLdvMLdPKMs3KMisrcy33BVdcyF0BFQVknfP7Y4TUQAaYmTMzPO/Xa14455w583CEh+8853ueozRNQwghhGcx6B2AEEII+5PkLoQQHkiSuxBCeCBJ7kII4YEkuQshhAeS5C6EEB7IS+8AAEJDQ7WYmBi9wxBCCLeyfv36TE3Twspb5xLJPSYmhnXr1ukdhhBCuBWl1J8VrdO1LKOU6qeUmpKdna1nGEII4XF0Te6api3UNO2eoKAgPcMQQgiPIydUhRDCA7lEzV0IUXsUFRVx6NAh8vPz9Q7Fbfj4+NCoUSNMJpPNr5HkLoRwqkOHDhEQEEBMTAxKKb3DcXmappGVlcWhQ4do0qSJza+TsowQwqny8/MJCQmRxG4jpRQhISFV/qQjyV0I4XSS2KumOsdLpkIKIUQl1q1bxwMPPKB3GFUiUyGFEKISycnJTJo0yebtNU3DYrE4MKLKSVlGCFHrpKenk5iYWPb8tdde45lnnqF79+48+uijpKSk0KJFC5YvXw7A0qVLufHGGwHIyMigV69eJCQkcPfddxMdHU1mZibp6em0bNmSESNGkJiYyMGDB7nvvvtITk4mISGBp59+uuz9YmJiGDduHG3atCE5OZkNGzZw3XXX0axZMyZPnmyX71FmywghdPPswm1sP3LGrvuMbxjI0/0Sqv364uJi1qxZw6JFi3j22WdZsmTJReufffZZrrnmGsaNG8fixYuZPn162bq0tDQ++ugjOnXqBMCLL75IcHAwJSUlXHvttWzZsoVWrVoB0LhxYzZt2sRDDz3EyJEjWblyJfn5+SQmJnLvvfdWO/5SktyFEOICgwYNAqB9+/akp6f/bf2KFStYsGABAH369KFevXpl66Kjo8sSO8DcuXOZMmUKxcXFHD16lO3bt5cl9/79+wOQlJRETk4OAQEBBAQE4O3tzenTp6lbt26Nvg9J7kII3dRkhF0TXl5eF9XEL5xm6O3tDYDRaKS4uLhK+/Xz8yv79/79+3nttddYu3Yt9erVY+TIkeW+j8FgKPt36fOqvm95pOYuhKh16tevz4kTJ8jKyqKgoIBvv/3W5td27dqVuXPnAvDjjz9y6tSpcrc7c+YMfn5+BAUFcfz4cb7//nu7xG4rGbkLIWodk8nEU089RUpKCpGRkcTFxdn82qeffpphw4Yxe/ZsOnfuTEREBAEBAeTk5Fy0XevWrWnbti1xcXFERUXRtWtXe38bl6U0TXPqG5YnOTlZk37uQtQOO3bs4IorrtA7jGorKCjAaDTi5eXF6tWrue+++9i0aZPD37e846aUWq9pWnJ528vIXQghquDAgQPccsstWCwWzGYzU6dO1Tukcuma3JVS/YB+5ohYmj++qFr78PEy0iTMj2Zh/jQN9aNZuD/NwvyJDvHFx2S0b8BCiFqvefPmbNy4Ue8wKqVrctc0bSGwsFGLxDFjujWt1j5yC4rZl5nLH/uyWLDxcNlyg4KoYN+/Jf1mYX4E+5mlt4UQwqO5RFkmItCHR/rYfkKjInmFxezLyGVvRg57S7+eyGHlnkwKiv+a9lTX11Ru0m8c7IuXUSYQCSHcn0skd3vxNXuRGBlEYuTFvWosFo3Dp8/9Lekv3Z3BF+sPlW1nMiqiQ/xoGupHbLg/g9pFEhse4OxvQwghasyjkntFDAZFVLAvUcG+dG958brsc0XsuyTp78vM5ZedJ5iybB+jrmzCA9c2x9+7VhwqIYSHqPUZK6iOibaN69G2cb2LlmflFPDK4p1MWbaPrzcd5vG+8fRr1UBq9UJ4gPT0dG688UZSU1MvWn733Xfz8MMPEx8fr1Nk9iMF5gqE+HszcXBrvvxnF8ICvHlgzkZum/oHu4+f1Ts0IYSDTJs2rUqJ3R5tAhxFknsl2jWux9f/upIXBiSy/egZbnhrOS98u52z+UV6hyaEqIHi4mKGDx/OFVdcweDBg8nLy6N79+6UXlA5ffp0WrRoQUpKCmPGjGHs2LEAjBw5knvvvZeOHTvyyCOPsGbNGjp37kzbtm3p0qULu3btAmDmzJkMGDCAXr16ERMTwzvvvMMbb7xB27Zt6dSpEydPnnTo91fryzK2MBoUt3eK5oakBkxcvJPpK/fzzeYjjL/hCm5q01BKNUJU1/ePwbGt9t1nRBJcP6HSzXbt2sX06dPp2rUro0aN4r333itbd+TIEZ5//nk2bNhAQEAA11xzDa1bty5bf+jQIVatWoXRaOTMmTMsX74cLy8vlixZwvjx45k/fz4AqampbNy4kfz8fGJjY3nllVfYuHEjDz30ELNmzeLBBx+07/d+ARm5V0Gwn5kJN7diwT+7EhHkw4Ofb2LolN/Zecy+/aiFEI53Yb+X22+/nRUrVpStW7NmDVdffTXBwcGYTCaGDBly0WuHDBmC0Wi9SDI7O5shQ4aQmJjIQw89xLZt28q269GjBwEBAYSFhREUFES/fv0Aa5vf8toJ25OM3KuhTVRdFvyzK5+vPcjEH3bSd9IK7uwcw4O9mhPoY9I7PCHchw0jbEe59BN3VT6BX9ja98knn6RHjx4sWLCA9PR0unfvXrbu0la+F7b5dXS9XpJ7NRkNits6Nub6xAhe/XEXH66ylmp6J9QnwNsLP28v/M8//Ly98Pfxwt/biL+3iXp+JsL8vaWcI4SODhw4wOrVq+ncuTOffvopV155JQsXLgSgQ4cOPPjgg5w6dYqAgADmz59PUlJSufvJzs4mMjISsNbZXYUk9xqq52fmpYFJ3NohipcW7eCH1GPkFBRfdEVseQJ9vIgN97/o0Tw8gMi6dTAYJOkL4WgtW7bk3XffZdSoUcTHx3PfffeVJffIyEjGjx9PSkoKwcHBxMXFERQUVO5+HnnkEe68805eeOEF+vbt68xv4bJ0bflb2jgsNjZ2TFpamm5xOEJRiYXcgmJyzj+s/y4hJ7+YjLP57MnIIe14DnszcsjMKSx7ndloINTfTGiANyF+ZkL8vQn19yYmxJdB7Rph9pLTJMK9uUvL35ycHPz9/SkuLmbgwIGMGjWKgQMH6haPW7X8LW0clpycPEbPOBzBZDRQ19dMXV9zpdueyi1kT0YOe07kkJ6ZS0ZOAVk5hWTkFLDz2FmycgopLLEwc1U6Ewe3olWjuo7/BoSo5Z555hmWLFlCfn4+vXv3ZsCAAXqHVCVSlnEB9fzMdPALpkNMcLnrNU1jyY4TPPHVVga8u5Ix3ZryUK8W0tJYCAd67bXX9A6hRuQzvhtQStErvj4/PnQ1tyRH8cGyfVz/1nLW7HfsRRBCCPclyd2NBNUxMeHmVnxyd0eKLRZunbKa2avT9Q5LCOGCJLm7oa6xoSz+91X0aBnOk19vY8L3O7FY9L8XrhDCdUhyd1N+3l58cEd7buvYmMm/7eWhuZsoKC7ROywhhIuQ5O7GvIwGXhyQyH+va8nXm44wcsZazkhDMyEu6/Tp0xf1kblUly5dnBiN40hyd3NKKf7VI5Y3bmnN2vST3DF9jSR4IS6jouRe2g5g1apVVdpfSYlrfmKWqZAeYlC7Rvh7e/HPTzYwYvoaZo9OIUD63AjxN4899hh79+6lTZs2mEwmfHx8qFevHjt37mT37t34+/uTk5ODxWJh7Nix/PLLL0RFRWEymRg1ahSDBw8mJiaGoUOH8tNPP/HII49w9uxZpkyZQmFhIbGxscyePRtfX19GjhxJnTp12LhxIydOnGDGjBnMmjWL1atX07FjR4e2K5Dk7kF6J0Tw7vB2/OuTDYyYsYZZoyTBC9f2yppX2Hlyp133GRccx6Mpj1a4fsKECaSmprJp0yaWLl1K3759SU1NpUmTJhdt9+WXX5Kens727ds5ceIEV1xxBaNGjSpbHxISwoYNGwDIyspizBjrtZhPPPEE06dP5/777wfg1KlTrF69mm+++Yb+/fuzcuVKpk2bRocOHdi0aRNt2rSx6/dfSsoyHua6hAjeua0dWw9lc+eMNXJTESEqkZKS8rfEDrBixQqGDBmCwWAgIiKCHj16XLR+6NChZf9OTU2lW7duJCUl8cknn1zU9rdfv34opUhKSqJ+/fokJSVhMBhISEhwaNtfGbl7oD6JEbw9rC33z9nI7dP+4KNRKTa1QRDC2S43wnaWC9v3Vvd1I0eO5KuvvqJ169bMnDmTpUuXlq27sM3vpS2AHdn2V0buHur6pAZMvr09O46e5dYpv5OZU6B3SEK4hICAAM6erfxeyF27dmX+/PlYLBaOHz9+UcK+1NmzZ2nQoAFFRUV88skndoy2+iS5e7Ce8fWZPjKZ9Kxchn6wmtTD2RRW0opYCE8XEhJC165dSUxM5L///W+F29188800atSI+Ph4br/9dtq1a1dh29/nn3+ejh070rVrV+Li4hwVepXo2vK3VHJyslZ6U1phf2v2n2TUzLXkFBTjZVA0DfMjOsSPsABvwgO86XlFfRIjy/+hFcLe3KXlL/zV9jcrK4uUlBRWrlxJRESELrG4VcvfMpZiyM2s3mvN/mDysW88HialSTA/PXwVa/afZNexs+w6dpaDJ/NY/+cpTuYW8uaSNG5s1YD/692SmNDq1R+F8EQ33ngjp0+fprCwkCeffFK3xF4drpHcj22FV5tV77V16sGwz6BxJ/vG5GEaBNXhpjaRf1t+Jr+Iqcv2MW35fhZtPUqXZqHc1KYhveLry0lYUetdrs7u6lwjuQdFwQ2PV/11mgZrPoBZA2Dox9C8p91D83SBPib+07sld3SO5uPVf/LVpiP8d94WlIL4BoE0D/dHKUWjenW4o3M04QHyKUkId+CQmrtSagDQFwgEpmua9uPltq9RzT0nAz4eCCd2ws1TIUG/22B5Ak3T2HTwNMvTMlm5J5Oj2floaBw+dQ4vo4GhyVHcc1VTooJ99Q5VuKkdO3YQFxcnN4ivAk3T2LlzZ5Vq7jYnd6XUDOBG4ISmaYkXLO8DvAUYgWmapk24YF094DVN00Zfbt81PqF67jR8OhQOrYEb34T2d1Z/X6Jc6Zm5fLBsL/PWH8Kiwegrm/BYnzi5mbeosv379xMQEEBISIgkeBtomkZWVhZnz57928VW9kruVwE5wKzS5K6UMgK7gV7AIWAtMEzTtO3n178OfKJp2obL7dsus2UK82DuHbBnCfR6Hro+ULP9iXIdzT7Hmz+l8fm6g/Rr3ZC7usYQ5u8tI3lhs6KiIg4dOkR+fr7eobgNHx8fGjVqhMl0cTsRu8yW0TRtmVIq5pLFKcAeTdP2nX+jz4CblFI7gAnA95Uldrsx+8Ktc+DLMfDTk5B/Gq55EmRkYFcNguow4eYkmoT5MeH7nSzcfASjQfHp3R3p2DRE7/CEGzCZTOVe7i/sq6YXMUUCBy94fuj8svuBnsBgpdS95b1QKXWPUmqdUmpdRkZGDcM4z8sMg2dAuxGw/HVY9H9gkYt27E0pxb1XN2PRA9348K4ORNatwyPzt5BX6LhLqYUQVeOQ2TKapk0CJlWyzRRgCljLMnZ7c4MR+k0Cn7qwahLkn4EB74FRuiPaW3zDQOIJpM5gI7dO+Z1hU37nusQI7r6yKWYvufhZCD3V9DfwMBB1wfNG55fpSyno9Rxc+xRsnQuf3wFF5/SOymN1ahrC8wMSKSrRmLh4F899u63yFwkhHKqmI/e1QHOlVBOsSf1W4DZbX6yU6gf0i42NrWEY5e4cuv0HvAOt5ZmPB8OwOeATaP/3EtzRKZo7OkXz8vc7+OC3fRw5nU94gDddY0O5LiFCRvJCOFlVZsvMAboDocBx4GlN06YrpW4A3sQ6FXKGpmkvVjUIh/eW2TIXFtwLDVrB8PngJyf+HKW4xML4BVvZcOA0GWcLyD5XRPNwf14alESHmGC9wxPCo9hlKqQjOaVx2K7vYe6dUC8GRnwFgQ0d+36CEovGzzuO8+zC7Rw+fY5/XNWUcTe4R8MoIdzB5ZJ77fms3PJ6uH0+nDkMM66DrL16R+TxjAZF74QIfnr4Km7tEMUHy/bx1Ub9T8kIURvomtyVUv2UUlOys7Od84ZNusGdC6EgB2b0gWOpznnfWs7X7MULAxJJjq7HuC+3Wq9ytej/iVEIT6Zrctc0baGmafdU1ADfISLbwV3fW6dMzrwBDq513nvXYl5GA+/d3o6kyCD+74vNXPP6UtKOV343HCFE9dSessyFwuNg1GKoEwyzboK9v+odUa0QHuDDp2M68tatbcgpKOae2evZl5Gjd1hCeKTamdzBemJ11GLr109vgR0L9Y6oVvAyGripTSTvDW/P0exzXPvGbyzcfETvsITwOLU3uQMERMDIb6FBa5g7Aja6xo1ta4OUJsEsf+Qa2kbVZfyXWxn35VbWpZ/UOywhPEbtOqFaHt9guOMraHIVfP1P+P19/WKpZcICvHnr1rY0CfPj281HGDx5NdOW7yMzp0Dv0IRwe7Vnnntligtg/mhreebqx6D7Y9JR0onyCot5YM5Gluw4gdGguLNzDCM6RxMd4is9v4WogFzEZKuSYlj4AGz6BDreC9e9DIbaXblypoLiEpbuymDprgw+W3sATYO+rRrwzrC2kuCFKIdd+rnXCkYv6P8O+ATB7+9ZZ9G0vxNaD7OWb4RDeXsZuS4hgusSIhjROZoZK/bzxfpDXBsXzqB2jfQOTwi3IsPSSxkMcN1LMHAKeAfAD+Ph9ZYwbxTs+036wzvJFQ0CmTi4FfENAnl8QSqr9mTqHZIQbkXXsswFXSHHpKWl6RbHZR1LhQ2zYMtnkJ8N9ZpYbwbSZjgE1Nc7Oo/347Zj3DN7PUrBUzfGc1vHxnh7GfUOSwiXIDV3eyg6B9u/gQ0fwZ8rweAFLfpA+5HQ7BrrFa/CIU6cyaf/Oys5diafR/vEcV/3ZnqHJIRLkORub5lp1iS/6VPIy4KgKGh7u/URJLVhRzibX8RN76yk2KLx83+uxmSUiqIQ0hXS3kKbQ+8X4OEdMPhDCGkGS1+GN5Pg01th7y/gAn80PUmAj4nHro/jwMk8xsxax7nCEr1DEsKlycjdXk7ut9bmN8yCvEwIbQEp90DrW60nZoVdfPrHAR7/aitNQ/0Ye00sA9vKJyVRe0lZxpmK8mHbAljzARzZaL3NX5vhkDLGOsIXNfb91qPc98kGQvzMrHuip8yBF7WWy5ZlXKL9gL2ZfKDNMBjzK4xeAi2ug7XT4O121vu4pv0k0ylr6PqkBkwYlERWbiH7MnP1DkcIl+QSI/eENgnanJ/mVOu1vl6+xATF4GVw4euxzh6H9R/CuhmQcxzMAdYLpspj8IKABtaTtHWj/voangChDriRuJvam5HDta//hsmo+HBkCkF1TCQ0DMRgkFG8qD1cvixTp0kdLfaZ6icuH6MPccFxJIYmEh8ST2JoItGB0RiUi50vLi6EHd/AwTVABce9pBCyD0P2Icg+CIWl/c4V9H0dOox2VrQuTdM03lu6l3d/3UPe+ZOr7RrXZeLg1sSG++scnRDO4fLJPa5VnDZ10dRqvTa7MJttmdvYlrWNHVk7yC/JB8Df5E98SDwJoQkkhCSQGJpIQ7+G7lWf1TQ4d8qa5H99CXYvhmuehG7/kaZm5x3LzmfHsTNs+PMUb/+yBy+D4rmbErmtY2O9QxPC4Vw+udvrhGqxpZh92fvKkn1qZiq7Tu2i2FIMQD3vesSHxpMYkkhiaCIJIQmE+YbV+H2doqQIvv4XbPkcOo+FXs9LU7NL7Dx2hn/MXs/Bk3m8fktrBrSJdK8/5kJUUa1J7uUpLCkk7VQaqZmp1oSflcre03uxaNaTmuG+4SSGJJIQmlD2Ncjbifd0rQqLBX4YB39Mhta3Qf+3K67d11I5BcX0euM3jmbnM21EMj3jpUWE8Fy1OrmXJ68oj12ndpGamUpqZirbs7aTfia9bH0j/0YkhCYQaA6s0n4Viv6x/Wkd1trOEV9A02DZq/Dri9C4C4S1/Gud0QQdxkBYC8e9vxvIOFtAhxeXEOrvzY8PXUWwn1nvkIRwCJdN7q7UOOxM4Rl2ZO0oG+HvyNrBueJzVdpHXnEeZqOZ+f3mU9/PwSPGtdNg+RvWck2pgjNgqgO3fQFRHRz7/i7ujZ92M+nnNCLr1uGzezoRFeyrd0hC2J3LJvdSnnIR0/7s/Qz9diitQlsxpfcU58/WObkfZg+0Tre8ZRY07+Xc93cx89Yf4v++2Iy/txe+ZiNhAd7MuacTgT4mvUMTwi5c9iImT9MkqAmPpTzGH8f+YOa2mc4PILgJjP4RQmJhzq2wZa7zY3Ahg9s3YvboFPq1bkC35mFsO3KGTi/9zNn8ospfLISbk+RuZwNjB9Iruhdvb3ibbZnbnB+AfziM/A4ad4Yvx8DuH50fgwvp1jyMlwe14vVbWjO2Ryx5hSUMem8Vu4+fpcSi/6dWIRxFkrudKaV4uvPThPqG8ujyR8krynN+ED6BMHwehMfDN2MhN8v5Mbig//RuwbCUxqSdyKH3/5bxyuKdHDypw/+PEE4gyd0BgryDePnKlzlw5gAT1kzQJwiTDwyaAnkn4buHpAUx1j+8Lw5I5MORHWgfXY8py/bRbeKv/LFP/vgJzyPJ3UGSI5K5O+luFuxZwOL0xfoEEZEEPcbD9q9h6xf6xOBiDAZFj7hwpo5I5o1brFNWh075nR+2HSO/SHrEC88hs2UcqMhSxMjvR7I/ez/z+8+ngX8D5wdhKYEPb4ATO2D4F9aSDUDdaDDL9MBluzMYMWMNANfGhfP+7e0xe8mYR7gHmS2jE5PBxISrJmDBwmPLH6PEosPI0GCEge+DpRhm9Ib3Olkf73SAQ+udH4+LuapFGIsf7EaPlmH8vPMED8/dhCsMeISoKUnuDhYVEMXjHR9nw4kNTNs6TZ8ggpvCvcthyEzrY8BkUAb4sA+s+7DW1+PjIgJ5/ZY2NA3z49stR3nhux16hyREjcnNOpygX7N+3NDkBt7f/D6bTmzSJ4iQZpAw0PpoMwz+8RvEdINvH4RF/2ct39RiwX5m5ozpRKi/mekr9jNt+T69QxKiRqTm7iRnC88yZOEQAOb1m4e/2QV6jltKYMkzsGoSxN8Eg6aCl7feUelqz4mz9HxjGf7eXrSOCuL929vLFa3CZUnN3QUEmAOY0G0Cx3KP8fSqpykqcYGrJA1G6P08XPeSdUbNh9fDNw/Awn/Djm9rZbkmNjyA2aNTaNUoiJV7shj6we+cOJOvd1hCVJmM3J3sw9QPeWP9GySEJDDxqok0DnSRm0psmQu/PG+9W1TxOcjPhuiu0PYO663/SjVsWytu91dUYmHgeytJPXyGQe0iubpFGM3C/EmMdNF20KJWksZhLubnP3/mqVVPUaKV8GSnJ+nbtK/eIV2spBg2zoJfXoS8zIvXmQPgvhVQL0aX0Jyt76TlbDtyBgAfk4Ftz/bBKPdpFS5CkrsLOppzlEeXP8rGExsZEDuAcSnj8DW52Lzzwjw4c+Sv5/mnrV0nw+PhrkXWso6Hyy0o5viZfJbsOM5Li3ZyQ1IE7w1vr3dYQgBSc3dJDfwbMOO6Gfyj1T/4es/XDP12KDtP7tQ7rIuZfa0lmNJHo2S44TU4+Dv8/FytqMn7eXvRNMyfEZ1jCPYz89uuDP75yXoycwr0Dk2Iy5LkriMvgxdj245lWu9p5Bblctt3t/Hpjk9d+yKaVrdAuxGw8k2YdxfsWvzXIydD7+gcxsdk5M2hbWgS5seirceYumwfq/dmufb/lajVpCzjIk7mn+SJFU+w/PBy+sT04dkuz7pemaaUplmnTy55Bs7fixaAgAbWdsMhzXQLzdGKSyy0f2EJ2eess53m/qMzKU2CdY5K1FZSc3cTFs3CjNQZvL3xbZoENuF/Pf5Hk6AmeodVsdMHIPf8CddzJ+HLe8DLx5rgg1047ho6fiafXcfOlvWk2fx0b4LqyFx44XxSc3cTBmXg7qS7mdxzMifzTzLsu2H8/OfPeodVsbqNIbKd9RHbE0Z8DUV58FE/a+L3UPUDfejWPJRuzUMB+Pj3PzmWLXPhhWuR5O6COjfszOc3fk6TwCY8uPRB3lz/JsWWYr3DqlxEEtzxlfVG3bMHwblTekfkMEop3rq1LV4Gxas/7OLfn23UOyQhLiJlGRdWWFLIhDUT+GL3F3SM6MhL3V4i3Ddc77Aql74SZt1krb2HXHLBU0gsXPuUx0yjPHgyjxe+286y3Zlc1SKU9tH1uOcqzz3nIFyLy5ZlakvjsOoyG8081fkpnuvyHJszNjPom0H8mO4G90SN6Qo3TwOjGU7u/+uRmWadZfP7e3pHaDdRwb7cmtKYmFA/1qWf4s0laTKDRrgEGbm7if3Z+xm/fDypWan0a9qPcR3HEWAO0DusqtE0+Ow22LUI/MLgts8h0nMuCJq+Yj/Pf7sdL4Ni9JVNGHfDFXqHJDyczJbxEEWWIqZsmcLULVMJ9w1nXMo46vvVr3D7AFMAUYFRTozQBnknYd10WDMNAupDv0l/rTP5QlgL/WKroaycAmb//idfbTzMydxC1jzeEx+TZ5SfhGuS5O5hNmdsZvzy8Rw4W/mMlPEdxzMsbpgToqqirfNg/ui/L7/tC2jR2/nx2NHjC7byyR8HGNK+Ea8Oaa13OMKDXS65e5W3ULi21mGt+aLfF6w7vu6yt+6bnzafCWsmUN+3Ptc0vsaJEdogabD1Pq65F1zVuvhR+PUF6y0BAZSCxp2hTl1dQqyu/17Xki/WHWLLoWzWpZ8kOUYuchLOJyN3D3au+ByjfxhN2qk0pl83nVZhrfQO6fLWz7T2kr9QhzHQ9zVdwqmJcV9uYc6agwAsf6QHUcEuerWxcGsuO1tGOFYdrzq8fc3bhNQJ4f5f7ufAGRe/sKjdnXDfarjnN+ujUQc47J438X66XwIvDUwC4NstR1my/TibD57WNyhRq0hy93AhdUKY3HMyJVoJDy99mMKSQr1DqphSUD8eGraxPqI6wvFt4Ap3raoiH5ORK2OtV7C+sngnd89ax4D3VnIq14WPv/AoktxrgZigGF7o+gK7Tu3inU3v6B2O7Rq0hpICSPsJTuyAIve6xL9xiC+//bc7C8deyaN94tA02HDgFCUW/UuhwvNJcq8lukd15+bmNzMzdSZrj63VOxzbNGxn/frZMHivE3z9L33jqYboED+SGgWRHFMPgNEfreP9pXt0jkrUBpLca5FHOjxCVEAUj694nLOFZ/UOp3KhsXDnQhgyE1r2tV78VHRO76iqpX3jenx4Vwfq+po4dMo9vwfhXmS2TC2zJWMLI74fgb/ZHx+jDwZlYEiLIdyddDdKufC9Qff+CrMHgF+4ta3w9a9A3A16R1Vlvf/3GwdO5lHP18xdXWOkD42oEZktI8q0CmvFxKsmck3UNXRp2IXowGgmbZzEq+tede2eKDHd4MqHrBc4FebApk/0jqhaHri2Of1bN6SoxMLytMzKXyBENcnIvZazaBZeXfsqH+/4mEHNB/F056cxKBf/m//NA7BtAfQYDwERkDBQ74iq7I7pf7A/M5dRXZvQpnFd2jWup3dIwg3JFaqiQgZl4JEOj+Bn8uODLR/gbfRmXMo41y7RXNEfNnwEix+zPm/QGoKb6htTFTUPD2B5WibPfbudlvUD+OGhq/QOSXgYFx+iCWdQSjG27VjujL+TOTvnMHXrVL1DurzmPWHcYRi9xPr8wO/6xlMNT954BZuf6s3AtpFl92MVwp5k5C7KPJz8MFn5Wby98W3CfcMZEDtA75Aq5u1vbRfsUxeWPAtrLviD1OY2SBmjW2i2UEoR5Guinq+ZjJwCbnpnBUOSo7i9U7TeoQkPISN3UcagDDzX9Tk6NujI86ufZ3vWdr1DujyDwVp3j0gC3xDr4+xRWDtd78hs1rdVBFe3CCM9K49FW4/qHY7wIJLcxUVMBhMTr5pIcJ1gHl76MNkFLn6XrI7/gNvn/fVoMxwyd7vN1azto4OZMbIDrRoFkVtYcYdPIapKkrv4m2CfYF6/+nWO5x3niRVPuPYUyUtFJIFWAm/EwcRmfz1+fUnvyC7L39uL1MPZtH/+J2b//qfe4QgPYPfkrpRqqpSarpSaZ+99C+dpFdaKh9o9xNJDS1m4b6He4diueS/ocj8kDIL4m6wP7wDY+Z3ekV3W6CubMCwlisISC+vTT+odjvAANp1QVUrNAG4ETmialnjB8j7AW4ARmKZp2gRN0/YBoyW5u7/hVwxnyYElTFgzgU4NOhHuG653SJUz+0HvFy5etnictVe8xWKt07ug5JhgkmOCWZd+inNFUp4RNWfrT/pMoM+FC5RSRuBd4HogHhimlIq3a3RCV0aDkee6PEdhSSHP//683uFUX0gzKMqDGb1hRh/rY9sCvaMql4/JyOq9WdzywWpW7ZUrWEX12ZTcNU1bBlz6WTEF2KNp2j5N0wqBz4CbbH1jpdQ9Sql1Sql1GRkZlb9A6CImKIb7Wt/H0oNLWXV4ld7hVE9sL2jeG0x1wGiy9ojf/LneUZVrSHIjEiODWP/nKX7ZcULvcIQbq8ln1Ejg4AXPDwGRSqkQpdRkoK1SalxFL9Y0bYqmacmapiWHhYXVIAzhaHfE30GkfySvrnv1svdsdVn1omH4F9YOk3cuhOgukH1I76jKNbxjNJ+O6URQHRMFxRa9wxFuzO4XMWmalgXca+/9Cv2YjWYebv8w//ntP7y/+X3ahrclLjiOkDoheodWPUFR8Ocq2LPk7+tMvhDVSffavLeXgfSsXH7bnYHJoEiOCcbs5ZrnC4RrqklyPwxEXfC80fllNlNK9QP6xcbG1iAM4Qy9onvRIaIDH2z5AIA2YW2YfcNsnaOqptAWUHAGPr65/PUjvoGmVzs3pksE+5lZnpZZ1jnylZuTGNqhsa4xCfdSk+S+FmiulGqCNanfCtxWlR1omrYQWJicnOza14oLlFJM7jmZ7VnbWXZoGVO3TmXTiU20CW+jd2hVlzwKGrWHS0tMZ4/B3DtcomTz0agU/szKo8SiccsHq8nMkXuviqqxdSrkHKA7EKqUOgQ8rWnadKXUWOAHrFMhZ2iats1hkQrdmY1m2oS3oUW9Fny26zNmbZ/lnsnd6GXtS3Opghzr11z9T/CH+nsT6u8NgNlo4Gx+sc4RCXdjU3LXNG1YBcsXAYvsGpFweb4mX4a2HMqM1BnM2TkHH6NP2To/kx+9onu5dsvgipj9wKsOpK+w9qmpaJv4m8BgdFpY/j5ebD54mrlrrfMXvE0G+iRG4O3lvBiE+9H1Zh0X1NzHpKWl6RaHqLqMvAz6LujLueK/3w/0g54f0CWyiw5R2cHkbnBsy+W3uet764wbJ7np3ZVsPnj6omVTRyTTK76+02IQrsllb9YhNXf3FeYbxpIhS8gtzC1bVmwpZvDCwfz454/um9xH/wi5FVw8lLELPrkZ8rKcGtLn93QiK9dacz9y+hxDJq+WHvCiUtLPXVRboDmQQHPgRcuubnQ1vxz4hSc6PYGXwQ1/vEx1oG5U+ess5+ve+WecFw/Wq1Yj69YBwGS0lrukRYGojEycFXbVK6YXpwpOseH4Br1DsT+fIOvXAucm9wv5mq1/MM8VyglWcXlyg2xhV3lFeVz9+dUUWYowqvJP+NX1rssX/b8g2CfYydHVUEkxPB8KygAVfSoJqA///APMvo4JwaIR+/giDEphNPx10jqhYSAL/tnVIe8pXJfL1tzlIibP42vy5eVuL7M1c2u567MLspmfNp8NxzfQM7qnk6OrIaMX3PQOZFZw8v/EDkj7AXJPgDnGMSEYFBNvbsXejL/OdaxNP8nGA6fQNM09ZykJh5ATqsLuekb3rDBxF5YU8s3eb9iSucX9kjtA29srXpf6pTW5F+Y5NIQhyRefE3j75zTW/3mKYotWVpMXQmruwqnMRjNxwXFszSh/ZO/WzH7Wr0V/nx7qSD4ma/lLGo2JC0lyF06XFJrEtqxt7tlh8nJM1hktFDl25H4pb5P117hAZtCIC7jhXDXh7lqFteLTnZ/S58s+FZ50Le81E6+a6ODIash0/iTqvLv+SvSVCY+H22rWW977fLfIG99ecdFJ1qHJUdx/bfMa7Vu4LzmhKpzu6kZXM7Tl0HKvbi3P7lO7+TH9R1668iXXnjsfkQQd74X8bNu2P5YKuxeDpkENToRe3SKcYSlRF5Vllu3OZMWeTEnutZhMhRQub/7u+Tyz+hl+uPkHGvo31Dsc+1n+Ovz8HDyRAV5mu+76jul/kFtQzJcyPdKjXW4qpNTchcsrTeiHc6p0uwDXZzyf0Evs387XbDRQVKL/wE3oR5K7cHmlyf1o7lGdI7EzByZ3k9FAocyeqdUkuQuX18CvAeCJI3eT9asjkruXgaISSe61mQufnRLCymw0E14nnNnbZ7NoX+W3D/A1+TKpxyTq+7l4S9zSkfuM68Bgqng7ZYBez0HLPjbv2mRUHDiZxzWvLy1b1rFJCC8PSqpmsMLdyGwZ4RbubXMva46uqXS7M4VnWHVkFbtO7XL95N60O7QeBsUFl99u+9fw54oqJfch7aMuKstsPZzNT9uPSXKvRaT9gHALQ1oMYUiLIZVud+DMAfou6MvpgtOOD6qmghrBwMmVbzehceV/AC7RuVkInZv9dTepp79O5evNR6oaoXBjUnMXHiXI29qWN7vAxrnm7sDLp8rJ/VJGg4FimT1Tq0hyFx4lwByAQnlWcjd61/ikq5dRUWyRE6y1iSR34VEMykCgd6B7lGVs5eVd45G7l0FRYpGRe20iyV14nCBzEGd0vFuS3XnZYeRuUBRLcq9VZCqk8Dh1veuy4+QOpm2dZtf9BpoDGdxiMAbl5DGR0QwZO2H5G1V/bUQraN4To8GApsG7v+65qI1Np6YhtGtcz36xCpchUyGFx4kLjmPu7rm8teEtu++7TXgbWtRrYff9XlZYHGz+FH5+tuqv9QuH/6YRE+qLUvDqD7suWt2ucV3pP+OhpHGY8DiaplFose9VnysPr+Tfv/6bj2/4mNZhre2670ppWvVq7j8+Dlu/gMcOAFBQXMKFv+73fryezJwCvr2/m50CFc7msvdQFcIRlFJ4G73tuk8/k/UuS4UOaBVQKaXA5FP11xm94YIbonh7Xdw732yU6ZGeTE6oCmED8/lWAUUlRTpHUgUGw0XJ/VLW6ZGS3D2VJHchbGA2WJO7vcs9DmXwAktxhauNBoNMj/RgktyFsEHpyF2Xskx1VZLcTQa5sMmTSXIXwgalyb2gpGYXEzmVMgIaVJDAjQYlNXcPJsldCBuUlmWKLO5Ucz8/X0Irv+4uNXfPJsldCBu4Z1nm/OyYCkozXlJz92gyz10IG+QW5dLp0054GbwwXe7GGjVgUAZe6PoCPaN72meHq96GH58ArzpcdFnqeYUllgrLMl5Ghdl4wdgv5koY/oV94hJ247Lz3OUKVeEu/Ex+jO84niM5jumJrmkaH23/iLTTafZL7gmDIC+rwpF7bl4hu47l/G152omzhPh5c0NShHXBvqVweIN9YhJOIzfrEMJGw+KGOXT/H23/CItmx9krQZHQ85kKV9cDOpWzfOJ7K/Hz9uKG3h2tCxb9F7bOs19cwimk5i6EizAog32Te7XjuKQ9sDJUeFJWuC5J7kK4CJdJ7gaFRbs0uet/bk5UjSR3IVyEARdJ7uqSqfHKAC4Ql6gaSe5CuAhXGbkb/zZyV5Lc3ZAkdyFchKskd4NSlFyU3I2S3N2QJHchXIQrJfeLrm1Sl+8uKVyTJHchXITrJHew/G22jP5xiaqR5C6EizAoAyUuMOXw7zV3Se7uSJK7EC7CoAy4QjsQVd48dzSZDulmJLkL4SIMyoAF/UfIRqUuzuOlDchk9O5W5B6qQrgIgzKw6cQmXl37qq5xpGvHOeqVz82fLwIgMmcbUcF1OTrjegI0M32KI/HxMtI6qi6GchqSVZupDnR5AHwC7bfPWkwahwnhIlqFtmLVkVXM261vH5ciLFj8NXblWZ/vNZRgDgighCMUGhTDj62nYbEF7YSx3G6T1WIpgeJzENkeWl5vn33WctI4TAgX8b8e/9M7hMtauHch41eMZ8313/LEvGN8948rSWgYZJ+dH90MH1wlUy7tSGruQgibqPOjdKWsBXn7nl8t/QQgJ23tRZK7EMImiotLMHZN7srggJ3WbpLchRA2MZQmYGWdNaPZc5RdWruXGTl2I8ldCGGTsrLM+ed2vf1q6R8OKcvYjSR3IYRNSssyWlnN3a7Z3fpFRu52I8ldCGGTsrLM+QRs35F7aXKXkbu9SHIXQtjEcD5dKOWAmS1KUpG9yREVQtimbLKMA0buUpaxO0nuQgib/DVytz632DO7S1nG7iS5CyFsUlqOKT2Rat+Bu4zc7U2SuxDCJn/Nc7emdYsjZsvIVEi7keQuhLBJ6VRIVZqA5QpVlybJXQhhk7KyTNnI3a47t36VsozdSHIXQtik7IQqpTV3R0yFlJG7vUhyF0LY5NL57TIV0rVJchdC2OTSsoxd2w/IVEi7k+QuhLBJWVmm9HyqNA5zaZLchRA2KSvLKAfU3KUsY3d2v82eUsoPeA8oBJZqmvaJvd9DCOF8ZV0hS9sP2DMPy1RIu7Np5K6UmqGUOqGUSr1keR+l1C6l1B6l1GPnFw8C5mmaNgbob+d4hRA6Kb2IySGXG0nN3e5sHbnPBN4BZpUuUEoZgXeBXsAhYK1S6hugEbD1/GZyt1shPERpWeabPz/Gp4GFZ3//ilfXG6u0j7gGgfh7l5N2igsgNAS2TYE9n9kjXH0FNbI+bDCk5RDahre1ewg2JXdN05YppWIuWZwC7NE0bR+AUuoz4Casib4RsInLfDJQSt0D3APQuHHjqsYthHCyxgGNaVmvJVkFx6gTmM9ZC5wtrNo+Sk56YzKq8lf6B4KlAAoyah6s3s4UQMEJmza9tvG1DgmhJjX3SODgBc8PAR2BScA7Sqm+wMKKXqxp2hRgCkBycrJ8FhPCxYXUCWFe/3l6hyFsZPcTqpqm5QJ32Xu/QgghbFeTqZCHgagLnjc6v0wIIYTOapLc1wLNlVJNlFJm4Fbgm6rsQCnVTyk1JTs7uwZhCCGEuJStUyHnAKuBlkqpQ0qp0ZqmFQNjgR+AHcBcTdO2VeXNNU1bqGnaPUFBQVWNWwghxGXYOltmWAXLFwGL7BqREEKIGtO1/YCUZYQQwjF0Te5SlhFCCMeQxmFCCOGBlF17Mlc3CKUygD8vWBQEZNv4PBTIdFBol76vvV5zuW0qWlfe8sqWXbreUcfKUcepsu1qcqwqe+5ux6qybRx1rOT3r+JlzviZitY0LazcNZqmudwDmGLrc2Cds+Kw12sut01F68pbXtmyco6bQ46Vo46TI4+VDc/d6lhVto2jjpX8/lW8zFk/UxU9XLUsc2nbgsqeOysOe73mcttUtK685ZUtc/fjVNl2NTlWteln6nLr5VjZts6Vf//K5RJlmZpQSq3TNC1Z7zjcgRwr28mxso0cJ9s5+1i56si9KqboHYAbkWNlOzlWtpHjZDunHiu3H7kLIYT4O08YuQshhLiEJHchhPBAktyFEMIDeVxyV0r5KaU+UkpNVUoN1zseV6aUaqqUmq6UktvrXIZSasD5n6fPlVK99Y7HlSmlrlBKTVZKzVNK3ad3PK7sfK5ap5S60RH7d4vkrpSaoZQ6oZRKvWR5H6XULqXUHqXUY+cXDwLmaZo2Bujv9GB1VpVjpWnaPk3TRusTqb6qeJy+Ov/zdC8wVI949VTFY7VD07R7gVuArnrEq5cq5imAR4G5jorHLZI7MBPoc+ECpZQReBe4HogHhiml4rHeEar03q4lTozRVczE9mNVm82k6sfpifPra5uZVOFYKaX6A99R+9qBz8TG46SU6gVsB2y7i3Y1uEVy1zRtGXDyksUpwJ7zo89C4DPgJqw36m50fhu3+P7sqYrHqtaqynFSVq8A32uatsHZseqtqj9TmqZ9o2na9UCtKotW8Th1BzoBtwFjlFJ2z1V2v0G2E0Xy1wgdrEm9IzAJeEcp1RedL/91IeUeK6VUCPAi0FYpNU7TtJd1ic51VPQzdT/QEwhSSsVqmjZZj+BcTEU/U92xlka9qX0j9/KUe5w0TRsLoJQaCWRqmmax9xu7c3Ivl6ZpucBdesfhDjRNy8JaRxaXoWnaJKyDBlEJTdOWAkt1DsNtaJo201H7dueyxWEg6oLnjc4vE38nx8o2cpxsJ8fKNrodJ3dO7muB5kqpJkopM3Ar8I3OMbkqOVa2keNkOzlWttHtOLlFcldKzQFWAy2VUoeUUqM1TSsGxgI/ADuAuZqmbdMzTlcgx8o2cpxsJ8fKNq52nKRxmBBCeCC3GLkLIYSoGknuQgjhgSS5CyGEB5LkLoQQHkiSuxBCeCBJ7kII4YEkuQshhAeS5C6EEB5IkrsQQnig/wf1JSk6NezGdQAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "unigram_freqs = [freq for token, freq in vocab.token_freqs]\n",
    "bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]\n",
    "trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]\n",
    "fig, ax = plt.subplots()\n",
    "ax.set_xscale('log', base=10)\n",
    "ax.set_yscale('log', base=10)\n",
    "plt.plot(unigram_freqs, label='unigram')\n",
    "plt.plot(bigram_freqs, label='bigram')\n",
    "plt.plot(trigram_freqs, label='trigram')\n",
    "plt.legend()\n",
    "plt.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4af972a",
   "metadata": {},
   "source": [
    "可以看到，除了基本上多元语法模型同样遵循齐夫定律，并且随着n元模型的n的增大啊，曲线越发平缓了。但是同时，随着n的增加，越来越多的ngram出现的越来越少，这让我们使用传统方法进行建模难度越来越大。也为深度学习的引入创造了机会。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43bca819",
   "metadata": {},
   "source": [
    "## 9.4.2 数据集构造"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b3e47db",
   "metadata": {},
   "source": [
    "和前面讲过的图像分类等模型不同，文本数据并没有标签，因此为了能够训练出模型，需要我们手动构造出带标签的数据集。还记得9.1节中，我们学习的时间序列建模吗？那时候，我们也是手动构造的训练数据集，用前t-n+1个元素作为输入X，第t个元素作为输出y。然后定义一个偏移量offset，最后就是遍历全部数据去构造数据集了。这里我们写好了相关代码，并封装成了一个DataLoader类，可以在后续训练中迭代取出训练数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd93410e",
   "metadata": {},
   "outputs": [],
   "source": [
    "class DataLoader:\n",
    "    def __init__(self):\n",
    "        self.corpus = corpus  ##全部语料list\n",
    "        self.batch_size = batch_size  ##批量大小\n",
    "        self.num_steps = num_steps    ##\n",
    "    \n",
    "    def __iter__(self):\n",
    "        ## 设置随机偏移量\n",
    "        offset = random.randint(0, num_steps)\n",
    "        num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size\n",
    "        Xs = torch.tensor(corpus[offset: offset + num_tokens])\n",
    "        Ys = torch.tensor(corpus[offset + 1: offset + 1 + num_tokens])\n",
    "        Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1)\n",
    "        num_batches = Xs.shape[1] // num_steps\n",
    "        for i in range(0, num_steps * num_batches, num_steps):\n",
    "            X = Xs[:, i: i + num_steps]\n",
    "            Y = Ys[:, i: i + num_steps]\n",
    "            yield X, Y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "016a2926",
   "metadata": {},
   "source": [
    "在下一节的循环神经网络模型代码部分，我们将使用这个DataLoader类来构造数据集。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f770e2d",
   "metadata": {},
   "source": [
    "**梗直哥提示：本节我们学习了文本数据的统计分析方法和数据集构造方法，。更加详细的知识点的掌握，有赖于你在实战中总结经验，慢慢就熟悉了。当然，如果你想大幅节省时间，解答自己在学习中的各种困惑，欢迎选修《梗直哥深度学习：python实战》。**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
